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Abstract 

Community detection is one of the most investigated problems in the 
held of complex networks. Although several methods were proposed, there 
is still no precise definition of communities. As a step towards a definition, 
I highlight two necessary properties of communities, separation and inter- 
nal cohesion, the latter being a new concept. I propose a local method of 
community detection based on two-dimensional local optimization, which 
I tested on common benchmarks and on the word association database. 

1 Introduction 

In the last decade interdisciplinary research on complex networks resulted in 
spectacular development It has become clear that networks constructed 

from diverse complex systems show remarkably similar features. Several aspects 
were investigated, like clustering [7], the degree distribution [8], diameter [9], [10], 
spreading processes pT| , diffusion jT^] , synchronization , critical phenomena 
[H] and game theoretical models on complex networks [15j. 

One of the most actively researched questions about complex networks is the 
one of community detection [16J. Community detection aims at finding dense 
groups in graphs, like circles of friends in social networks, web pages about the 
same topic, or substances appearing in the same pathway in metabolic reaction 
networks. Perhaps the strongest motivation behind the research is that dense 
groups in the topology are expected to correspond to functions performed by the 
network, such that one can infer from pure topology to function. While the con- 
cept of communities seems intuitively plausible, attempts for an algorithmically 
useful definition have not been successful yet. The global characterization by 
modularity [T7] or by random walks [121 HB], the local "weak" and "strong" def- 
initions , the clique percolation approach , or the multiresolution methods 
[521 mi '^M have all increased out understanding of this complex problem but 
the proliferation of methods of community detection just indicates the difficulty 
of this issue [16) . 

Unfortunately, any precise definition of communities is still lacking, giving 
rise to innumerable methods using different definitions. Lack of a definition also 
makes problematic the testing of methods; although there is progress in this 
issue [25^, [26^. Difficulty of the problem is increased by more subtle factors: very 
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often communities occur on a broad scale, they can be ordered in a hierarchical 
manner, and they may overlap, which make their identification even harder. 

After being the subject of active research for several years, it is getting clear 
that the following stages appear during community detection; 

1 defining the term "community" ; 

2 finding the objects corresponding to the definition; 

3 determining the significance of the found communities. 

Although from the theoretical perspective stage f is clearly a key issue, it is 
far from being settled. Several different propositions exists, which are evaluated 
mostly according to their results on a few benchmarks. This is the stage to 
be improved in the first place in this paper. Stage 2 is a technical issue, often 
consisting of some combinatorial optimization method. Its choice is usually a 
result of a trade-off between speed and quality. Stage 3 should give informa- 
tion about how surprising is the existence of a found community in the actual 
graph, given some characteristics of the graph like edge density or degree dis- 
tribution. Although this issue also got some attention [27]- [33], it just began to 
get widespread application [34] . 

The rest of the paper will focus on the question of definition, so a few re- 
marks about stage 3 are made here. Most community detection methods give no 
information about the significance of their output, thus forcing the investigator 
to assume that all results are (equally) significant. This way, the community 
detection stages 2 and 3 are combined into a single decision whether a par- 
ticular subgraph is a good enough community or not - effectively pruning the 
significance test in practice. The other end of the spectrum, represented by [34| . 
builds the definition of communities on statistical significance, which is clearly 
an improvement. However, it should be noted that the fitness and statistical 
significance of a subgraph as a community are not synonyms. Statistical signif- 
icance tells us how surprising a subgraph is, while fitness talks about how close 
is it to the ideal community. Therefore, the two quantities are complementary 
and both belong to the description of a community. 

2 Local criteria for communities 

A fundamental problem of community detection is to define the term "commu- 
nity". There are different approaches to this question. One is the algorithmic 
approach, giving a computational procedure for finding clusters. This naturally 
incorporates a mathematically precise definition, although different algorithms 
usually result in diverse definitions, and there is no theoretical framework cur- 
rently to help their differentiation. Another possibility is to present a general 
concept, on which a precise definition can be based. In this paper, the latter 
approach is taken, although an algorithmic realization is also presented. 

No definition of communities which is both precise and generally accepted 
has appeared yet. Currently the description of communities exhausts in the 
phrase "nodes having more edges among themselves than to the rest of the 
graph" (or equivalent forms). It can be translated roughly to "statistically sig- 
nificant locally dense subgraphs". Statistical significance is a quite precise ex- 
pression, the main problem is with the term "locally dense". For an intuitive 
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picture, it is quite good, but much less than directly transformable to algorithms. 
Although there is an implicit agreement on that clearly counterintuitive results 
are not permitted, even a formal list of required properties is missing. How- 
ever, there are some properties which fit human intuition about locally dense 
subgraphj^ 

Separation: a good community is well-separated from the rest of the graph; 

Cohesion: a good community is homogeneously well connected inside, i.e. it is 
hard to separate into two communities 

The separation criterion is quite clear, although there is an important re- 
mark: separation should be defined locally, involving only the community under 
investigation and its immediate neighborhood. Global methods, in which dis- 
tant regions of the graph can modify a community in order to improve a global 
fitness value, can produce results violating the human perception about clusters. 
A famous example is the resolution limit of modularity [351 IMj • 

Although separation is a very intuitive criterion, and famous methods rely 
on it (see the Appendix), it is not enough in itself. Figs, lapb illustrate that the 



distribution of links inside the separated region (the "shape" of the subgraph) 
also matters heavily. Application of current community detection methods to 
real-world networks confirms that this is a real problem, e.g. tree-like commu- 
nities can occur, even when the whole network is not tree-like |37] . |38] . 

Both separation and cohesion are required properties of communities. If one 



neglects cohesion, the result may contain clusters like the one on Fig. la On the 
other hand, if separation is not taken into account, one may end up chopping 
a separated subgraph until very cohesive pieces (cliques in the extreme) are 
obtained, like the triangles on Fig. |lb| 

Given the subgraphs on Figs [Ta| and |lb| as proposed communities, most 
community detection methods' fitness values, to be reviewed in the next section, 
can not tell the difference between them. This is due to that most methods 
simply count the internal and/or external edges, which do not tell about the 
distribution of those edges. The reason why several methods do not fail to assign 
proper clusters for Fig[Ta|is that they look for optimal clusters, consequently 
they compare configurations like Fig[la|in one cluster and in two clusters, and 
splitting the two cliques into two clusters may improve the partition. But the 
situation is even worse. In the next Section, we will see that a number of fitness 
functions are more optimal for a counterintuitive clustering than for the intuitive 



one (e.g. joining the two cliques on Fig la like modularity for a large enough 



graph). It should be noted that in such a case, the proper communities might 
be recovered if the heuristic gets stuck in the proper local optimum, even when 
that is not the global optimum. 



^ For brevity, the words "community" , "group" and "cluster" will be used from this point as 
synonyms for "locally dense subgraph" , omitting the statistical significance from the meaning. 

^It should be noted that the meaning of the term "community" can depend on the context; 
consequently a single definition may not be enough. Here the aim is to describe a particularly 
intuitive one. 

^The term "cohesion" also appeared in |22| . although there it denotes a quantity with an 
unrelated concept. 
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(a) (b) 

Figure 1: Illustration of the importance of subgraph shape. The two subgraphs 
have the same number of nodes and the same degrees, i.e. they differ only in 
the distribution of links. The figure on the left is much less cohesive than the 
figure on the right, although just a reorganization was applied to the links. 

3 Overview of the existing methods 

Here, existing community detection methods will be reviewed from the point of 
view of the previous section, i.e. do they conform the criterions of separation 
and cohesion. As mini-benchmarks, the examples on Fig. [l] or their simple vari- 
ations will be used (see the Appendix for details on individual methods). The 
desired output for[Ta|is two communities consisting of the two cliques, while [Tb| 
should be kept in one piece. In both cases, no nodes from the rest of the graph 
should be included. For methods optimizing a fitness function, the globally opti- 
mal solution will be considered, for other methods, the possible solutions. These 



solutions will be compared to the desired ones, independently for Fig. la and 
|lb[ If a method separates the two cliques of Fig. [Ta] then it gets a , if it 
puts all nodes of Fig. [lb] into one cluster, then it gets another . If there 
are multiple equally valid solutions (like for label propagation) , all solutions are 
required to conform the preferred result. 

For methods optimizing a function, the heuristic realizing the optimization may 
deviate from the global optimum, presenting worse or even better results (in 
terms of conformity to separation and cohesion). This will not be investigated, 
here the focus is on the definition of the communities (following from the choice 
of the fitness function), not on the practical aspects. Results for methods which 
can produce a single partition or cover are displayed in Table [l] The large num- 
ber of published methods makes assembling a complete list nearly impossible. 
Instead, the emphasis is put on the diversity of the reviewed approaches. 

There is a bunch of multiresolution methods, which possess a parameter 
allowing to tune the cluster sizes from 1 (isolated nodes) to 0{N): the multires- 
olution modularity of Reichardt and Bornholdt (RB) [52], of Arenas, Fernandez 
and Gomez (AFG) [23], the local fitness method of Lancichinetti, Fortunato 
and Kertesz (LFK) [39], the Potts model of Ronhovde and Nussinov (RN) [43] . 
the Markov autocovariance stability of Delvenne, Yaliraki and Barahona (MAS) 
[TO] , the hierarchical likelihood method of Clauset, Moore and Newman (CNM) 
[55] , and the Markov Cluster Algorithm of van Dongen (MCL) [57j . Naturally, 
these methods are expected to find the proper community assignments both to 
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method 



cohesion test 



(hke Fig. la 



separation tes t 
(Uke Fig. fib I 



Lancichinetti et al. [39] 
Labelpropagation [40] 

Infomap [15] 
Clique Percolation [5T] 
Estrada & Hatano [H] 
Modularity optimization [T7] 
Donetti & Munoz [42] 
Ronhovde & Nussinov [43] 
Nepusz et al. gl] 
Hofman & Wiggins [IS] 

Hastings [IS] 
Newman & Leicht [47] 
Wang & Lai [48] 
Bickel & Chen ^ 
Karrer & Newman [SD] 
Infomod [5T] 
Radicchi et al. [20] 
Chauhan et al. [52] 
Evans & Lambiotte [S3] 
Ahn et al. [51] 
ModuLand [5S] 



+ 



+ 
+ 



+ 



+ 



Table 1: Cohesion & separation criterion test results. Tests were done on Fig. 
la and lb or similar graphs (which are described in the Appendix). + and - are 
assigned according to whether the fitness function of a method is more optimal 
for the preferred solution or not. For methods which do not optimize a fitness 
function, simply the possible solution(s) was (were) analyzed. See the Appendix 
for details on specific methods. 



Fig. [Ta] and [Tb] at some parameter values. However, there is no guarantee that 
these values are also the proper ones for the rest of the graph. Consequently, it 
is not clear how a resolution parameter should be set: the natural idea is to find 
the longest interval of the resolution parameter value in which the community 
structure does not change, but when the optimal parameter value is different for 
different regions in the graph, the longest stable interval not necessarily reflects 
the optimal communities. 

Furthermore, the fitness values do not help us to tell good clusters from bad 



ones, like Fig. la from Fig. lb For most multiresolution methods (RB, AFG, 
LFK, RN), it is very easy to see that the fitnesses of two clusters are the same 
given that all nodes has the same in- and outdegrees, independently of the shape 
of the clusters. Note that it is also true for most single resolution methods. For 
MAS it is not trivial. Therefore, empirical tests were conducted to check it. 
According to them. Fig. [la] was found empirically to be at least as good as 
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W\ Finally, regarding MCL and CNM, they have no fitness functior{^ the only 



accessible quantity about the community structure is the parameter interval in 
which it is stable. 

Finally, there are hierarchical methods, which look for series of smaller and 
smaller (or larger and larger) clusters hierarchically embedded into the previ- 
ous ones. Similarly to multiresolution methods, they are expected to contain 
good clusters in the outputted hierarchy. However, when looking at a graph 
having a simple one-level community structure, the question how to select the 
proper levels of the outputted hierarchy arises. The easiest way is to use the 
lowest level communities. Unfortunately, it is not a reliable procedure, as the 
lowest-level clusters may be just parts of the communities of the optimal par- 
tition or cover (see the Appendix for details). A second idea can be to assign 
significance scores to the communities on different levels, in the spirit of [53]. 
Although this approach might reliably qualify the found communities, a new 
version of statistical significance taking into account the internal cohesion is 
required. Furthermore, one should be very careful not to impose unnecessary 
constraints, like prohibiting overlaps, when constructing a hierarchical method. 

A further question is whether a method provides information about the 
shape of the found communities or not. Recent analysis of real- world networks 
highlights the relevance of this issue [37], [38]. Several methods are based on 
simply counting the internal and/or external edges, or degrees at most: LFK, 
Labelpropagation, Infomap, modularity optimization (and equivalents), Hofman 
& Wiggins, Hastings, Ronhovde & Nussinov, Newman & Leicht, Wang & Lai, 
Bickel & Chen, Karrer & Newman, Infomod, Ahn et al., OSLOM. Consequently, 
they do not see any difference in the distribution of the links, e.g. Fig [Taj and 



lb get the same fitness values. Only Clique Percolation and Radicchi et al 



method have some very limited requirement about cohesion built in the defini- 
tion of communities. 

The conclusion is that none of the reviewed methods is able to successfully 
apply both the separation and the cohesion criterions. They susceptible either to 
glue together well-separated subgraphs or to overpartition a cohesive subgraph. 
Future network designs should consider cohesion as well as separation. 



4 Community detection in a two dimensional 
parameter space 

In this Section, a new method for community detection is introduced. Its main 
goal is to present a method which takes into account both criterions defined in 
Sec. [2] First, the LFK method will be reviewed, which will serve as a starting 
point for the new method. Then, a composite fitness will be constructed which 
takes into account the separation and cohesion criterions. Finally, a heuristic 
optimization procedure for the composite fitness will be described, which finds 

^In this case, only 1 link to the rest of the graph was used. Rest of the graph, represented by 
a single node having self-loops, was assigned 118 edges inside, resulting in a total of L = 150 
edges. Stability values were calculated from 0.01 to 100, the step size being 0.01 below 1.0 
and 1 above. 

^CNM does have a fitness function, but it corresponds to a full hierarchical dendrogram, 
not to any partitions obtained by cutting the dendrogram at some point 

®If the stopping criterion of their heuristic is considered as part of the definition. 
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locally dense subgraphs on all scales, and also able to recover hierarchical struc- 
tures. 

4.1 The LFK method 

The LFK method [35] optimizes the local fitness function 

where C denotes a subgraph, Kg and Kg^^ are the total number of inside and 
outside degrees in C, respectively, and a is a tunable exponent for setting the 
size scale of the communities to be found. Running the method with large a 
values result in small clusters, small values in large clusters. The recommended 
range for a is 0.5-2. 

The practical implementation of the optimization works as follows. The com- 
munities are found one-by-one, independently of each other. First, a seed node 
is selected from which the new community will be grown. Then, the node which 
can best improve the fitness of the cluster is added. This addition is repeated 
until the fitness reaches a local optimum. After each addition, removal of nodes 
takes place, if the fitness can be enhanced that way. When the fitness cannot 
be further increased, the actual subgraph is declared a community. The growth 
process is repeated for all nodes as seeds, or alternatively, until the found com- 
munities cover all nodes in the graph. 

Although the resolution parameter a can be tuned continuously, [39J sug- 
gested that the relevant community structures should be identified by robust- 
ness to changes in a, i.e. which have the longest interval for a values without 
change. Changes in the community structure were detected by monitoring the 
mean fitness of the communities, evaluated at a reference value a = 1. 



4.2 Implementing the criterions 

For the separation criterion, the following function will be applied 

c 

^ "-out 

where C is a subgraph, and Kout are the sums of in-community and out- 
community degrees, respectively. This is the fitness of LFK [3S], with the mul- 
tiresolution parameter being set to one. For detecting hierarchical structures, a 
different solution will be described. Eq. [2] clearly focuses on the external sepa- 
ration of the clusters, therefore it is suitable as an implementation of the first 
criterion of the communities. 

For the internal cohesion criterion, a possible solution is to consider the sec- 
ond eigenvalue of the Laplacian matrix of the community. The Laplacian of a 
graph is the matrix L — A—D, where A is the adjacency (or weight) matrix, and 
D = diag(fci) is a diagonal matrix containing the degrees (strengths). Its largest 
eigenvalue is always (corresponding to the trivial eigenvector (1,1,... ,1)). 
The multiplicity of the largest eigenvalue equals to the number of connected 
components in the graph. This gives the hint that if two distinct graphs are got 
connected by a single (weak) link, the Laplacian gets only a slight perturbation 
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(compared to the case of two connected components), which sphts the double 
degeneracy of the first eigenvalue, such that a new eigenvalue close to zero ap- 
pearfj^ In fact, it is known that the second eigenvalue of the Laplacian measures 
"how difficult is to split the graph into two large pieces" [58] , 

For some important special cases the second eigenvalue can be calculated: 

- for full graphs of n nodes (clique), A2 = —n 

- for a star-graph, A2 = —1, independently of n 

- for a linear chain, A2 = — 2 + 2cos(7r/n) — > as n — > cxi 



for two n-sized cliques attached by a single link (having weight e) (like on 
Fig. la), A2 ~ which also goes to as n 



- for a disconnected graph, A2 = 0. This may seem trivial, but most methods 
give a finite score for disconnected communities; it is not without precedent 
that such objects can be produced in reality [38]. Although this problem 
can be avoided by a properly designed heuristic of a method, disconnected 
communities should be punished by definition. 

Calculation for the two cliques is in the Appendix, other results can be found 
in [SS]- These cases confirm that the second eigenvalue is useful for quantifying 
the cohesion criterion of the definition of communities. For an illustration, on 
Fig. [2] a few example graphs with their second Laplacian eigenvalues are shown. 





(c) 



(d) 



Figure 2: Graphs with different second Laplacian eigenvalues. A2'*'' — 6, Aj^"* — 2, 



X)^' = 1, A2 = 0.268. The maximal value of A2 is 6 in all cases. 



The separation fitness term fs ranges from zero to one. In order to compose 
it together with the cohesion fitness, the latter should also be in the interval 
[0, 1]. Therefore, A2 needs some transformations before application as fitness. 
As can be seen from the above examples, for the worst cases IA2I is of the order 
of 1/rt, therefore the lowest point of the |A2|-scale will be set to 1/n. The highest 
point is trivially given by n. It is reasonable to assume that most subgraphs have 
IA2I = o(|C|). Furthermore, several subgraphs can have worse internal cohesion 

^The diffusion matrix was also considered, but it prefers star-like graphs too much. 
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than the star graph, thus having IA2I G [0, 1]. To take into account these effect, 
log IA2I will be more useful than A2. So, in order to obtain a quantity between 
and 1, the minimum will be subtracted and divided by the maximum, 

C^ l0g|A2|-l0gl/|C| ^ 1 l l0g|A2| ..|^|. , 

log|q-logl/|C| 2^21og|C|' ' (3) 
= if|C| = l 

where |C| is the number of nodes in the community. The above measure happens 
to be 0.5 if A2 is 1, e.g. for the star-graph. I wish to emphasize that eq. [3]is only 
one possible proposition for taking into account the internal cohesion, although 
a promising one - better measures may exist. The same is true for the choice of 
fs- 

The cohesion fitness opens the way for constructing tests assessing the 
performance of community detection methods regarding the cohesion of the 
found communities. One may generate a graph with built-in communities which 
separation is controlled, like in the LFR benchmark |25j, then randomly select 
pairs of clusters and increase the interconnection between the two members of 
each pairs to some predefined value, finally calculating of the pairs. Running 
the detection method and measuring the ratio of pairs not split as a function of 

may indicate how strongly focuses the method on cohesion. 

The next question is how to combine fZparaUon and fZhesion- Thinking in 
a two dimensional space of fg and , a natural approach is to get as far from 
the point (0, 0) as possible. This implies 



r-^ifkr + ifcr (4) 

so the fitness is the euclidean distance from (0, 0). Again, this is just one possi- 
bility, better combinations may exist. E.g. the relative weight of fs and fc may 
be adjusted in a more w ell-g rounded way. However, eq.|4]is able to pass the test 
raised by Fig.|l| for Fig. [la| Aa'"''""" = 0.258, /^^"q™'' = 0.228, /2ciiqucs ^ o.995 



lb 



while for a single clique A^ = 6, Z^"""'"'' = 1, = 1.371. For Fig 

A^^iodes ^ 3 268, /i2nodcs ^ 0.738, /i2nodcs ^ ^ 218, and for the best subgra^, 
a triangle, A*"""^''' = 3, /^"^"s'^ = l, /triangle ^ ^qjj^ 

Beyond enabling one to decide whether a given subgraph is a community 
or not (by requiring local optimality), the above definition makes it possible to 
assess how good community it is. This is also possible with another definitions, 
e.g. by using the modularity function, but here, communities are placed on a 
2-dimensional space instead of 1 dimension. This gives rise to an interesting 
possibility for characterizing the communities, like "very cohesive but densely 
connected outwards" or "well-separated but poorly interconnected". Consider- 
ing Fig. |la[ one may think that the latter is not really a community. But for 
large subgraphs, it may make sense to consider a well-separated subgraph as a 
community, as common sense says that large communities should be looser than 
small ones. 

4.3 Community detection in reality 

In this section, the details of practical implementation of the new method are dis- 
cussed. Most importantly, in order to actually find the communities, a heuristic 
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carrying out the optimization of eq. |4] is needed. Furthermore, there is a sec- 
ond problem of detecting communities hierarchically embedded into each other. 
These two questions will be answered by a common solution. 

The heuristic is based on the one of the LFK method [39]. Among its details, 
the LFK heuristic contains a tunable parameter (denoted as a), which is claimed 
to be able to recover communities at different hierarchical levels. Lowering this 
parameter a results in increased community sizes. Hierarchical levels are sup- 
posed to be stable against the variation of a, so there should be long intervals 
for a for which the communities do not change. However, large graphs may lack 
long stable intervals, as some changes occur around any parameter value (data 
not shown). Therefore, a new method for investigating hierarchical structure 
is needed. I dropped the idea of using threshold values of a, corresponding to 
community structures at different scales, which should be simultaneously valid 
for all communities, and I will treat each community separately. 

Similarly to [39], each community is grown from a seed node. It is important 
to note that each seed node can result in a series of (successively larger) com- 
munities. Growth consists of successively including the neighboring node which 
increases most the fitness defined by eq. [4] When there is no neighboring node 
which inclusion can improve the fitness, the stage of node removal begins. Here, 
the fitness of the cluster is tried to be improved by excluding nodes from it 
(with the exception of the seed node, which is not permitted to be excluded). It 
finishes when no further removal can improve the fitness. Then, growth begins 
again, if possible. The grow-shrink cycle is iterated, as long as the fitness can 
be improved. When no improvement is possible (there is a local optimum of 
the fitness), the actual list of nodes is registered as a valid community. After 
that, the algorithm tries to find a larger community, which contains the current 
one. This way, hierarchical structures can be revealed. In order to do it, first 
the growing cluster should escape from the basin of attraction of the current 
local optimum. Therefore, the cluster is forced to grow, by successively including 
the neighboring nodes which decrease the fitness the least. After some steps of 
forced growth, when increasing the fitness becomes again possible, the algorithm 
turns back to the normal grow-shrink procedure, until a new local optimum is 
found, signing a new community. The cluster keeps hopping from local opti- 
mum to another local optimum until it grows so large that it contains the whole 
graph. Then a new growth process starts from a new seed node. At the end 
of its growth process, it includes the whole graph again, unless it encounters a 
local optimum which has been already found, i.e. the corresponding community 
has already been registered. In this case, the growth process is stopped. Then, 
another growth process starts from a not-yet-used seed node. In contrast to [55] . 
all nodes in the graph are used as seed nodes, in order not to miss good com- 
munities. When the growth process beginning from the last seed node finishes, 
the algorithm ends, and the registered communities are written to the output. 
There are a few additional tricks. First, if escaping from a local optimum seems 
to be hard, i.e. after changing from forced growth to the normal grow-shrink 
stage we still end up in the previous local optimum, the cluster is restored to 
the state where it had its maximal size (the beginning of one of the removal ses- 
sions), then 2 steps of forced growth is applied before the normal grow-shrink 
cycle begins. A second trick is that when judging the identity of two commu- 
nities, they are considered identical if at least 80% of the larger community is 
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a subset of the smaller onqj In case of identity, the community which has the 
higher fitness is kept in the registry. 

The algorithm, although based on the one of [39], differs in several points: from 
one seed, several communities can be reached instead of only the smallest one; 
node removal occurs when node addition is not possible instead of after each 
addition (this trick also speeds up the algorithm) ; seed node is not permitted to 
be removed; all nodes are used as seeds instead of the not-yet-covered nodes. An 
algorithm similar in spirit was described in [SD] . The results in the next section 
are obtained using this method, unless stated otherwise explicitly. The software 
realizing the algorithm is available at 
http: //www.phy .bme .hu/~tibelyg/. 

5 Test results 

Probably the most frequently used test is Zachary's karate club friendship 
network [6 1. Due to a dispute between two prominent persons (node 1 and 34), 
the club split into two during sociological observation, and the memberships 
in the new clubs are known. As the split occurred more or less along a border 
of two visible communities, new community detection algorithms are usually 
claimed to pass the test if they reproduce the split. However, the aim is the 
detection of topological modules, not functional ones, so the result of the socio- 
logical study is not a strict criterion for judging the output of any community 
detection method. E.g., node 10 has 1 — 1 links to each of the new clubs, so 
"misplacing" it (compared to the split) may not be considered as a fault. Or 
node 12, which attaches only to node 1, is hard to be considered as part of a 
"densely interconnected" cluster. 

The algorithm finds 33 groups, containing several non-relevant ones, like pairs 
of nodes. Therefore, a filtering procedure is required. The statistical significance 
of the resulting communities [53 [31] is utilized for this purpose. The statisti- 
cal significance can be sensitive for missing nodes [29], therefore each cluster 
is allowed to be completed with the neighboring node which optimizes the sta- 
tistical significance. Then the clusters are ordered according to their statistical 
significance. The first 3 clusters provide a single-level community structure, cor- 
responding to 3 known communities, with 2 overlapping and 1 homeless nodes 
(Fig. |3j left panel). Taking a look at the subsequent clusters provides informa- 
tion about the multi-scale structures in the graph. The next few clusters reveal 
cluster cores and hierarchical decomposition of the network (Fig. [3j right panel) . 
The statistical significance score is quite capable of distinguishing meaningful 
structures; there is a gap between 0.42 and 0.81, so setting a threshold to 0.5 
selects the multi-scale clusters which would be approved by a human investiga- 
tor. There is only one exception, the almost-full-clique of nodes {1, 2, 3, 4, 8, 
14} has significance 0.81, which is probably the consequence of neglecting the 
internal cohesion by the current form of statistical significance. 

The currently most advanced class of benchmarks was introduced by |25| . In 
these so-called LFR benchmarks, the network size and edge density are freely 
adjustable, and more importantly, the node degrees and the community sizes 
are distributed according to power-law distributions, with tunable exponents. 

*If the criterion were based on some percent of the smaller group, subset-superset pairs 
would be considered identical. 
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Figure 3: (Color online) The 3 (left) and 10 (right) best found commmiities of 
the Zachary karate club. On the right, thicknesses of lines indicate the ordering 
of the statistical significance values (running from 0.002 to 0.42, plus 0.81 for 
the dashed line-bordered community). Note that node 12 is contained only by 
large communities. 
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Figure 4: (Color online) Positions of the Zachary communities on the fs-fc 
plane. Small groups tend to cluster at North, and large groups at East. 
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Communities are defined through a prescribed ratio of inter-community links 
for each node (mixing ratio, /i), similarly to the preceding GN benchmark class 
[17j . Generalizations for weighted and directed networks, and for overlapping 
communities also exist [26] , 

A wide-scale comparison of different community detection methods using the 
LFR benchmark was done by |62| . For the ease of comparison, the parameter 
values of are applied here: the networks consist of 1000 nodes, the average 
degree is 20, the maximal degree is 50, the exponent of the degree distribution 
is -2 and the exponent of the community size distribution is -1. There are two 
types of networks, for the S type the community sizes are between 10 and 50 
("small") and for the B type they are between 20 and 100 ("big"). In [5^ . 
networks of 5000 nodes were also investigated. Due to the large computational 
time, they are omitted her^ Also for computational time considerations, the 
detecting algorithm stopped growing the communities over a predefined size, 
120 for the S case and 220 for the B case. All measurement values are obtained 
from runs on 10 different networks. 

Similarity of the built-in and the obtained community structures are quantified 
by a variant of the normalized mutual information (NMI), which is able to 
handle overlapping communities 39 . This is the similarity measure applied by 
[62|5 

Selecting the most relevant communities from the abundant output was done 
similarly to the previous case. The clusters were completed by 1 neighboring 
node, if that improved the statistical significance, and sorted with respect to 
the statistical significance scores. The clusters containing at least 1 uncovered 
node were accepted one by one until all nodes were covered. 
To see the potential of the new method, and check the effect of the output- 
filtering, the communities corresponding best to the built-in original ones were 
also selected from the algorithm's output. The results are plotted on Fig. [5] (a). 
The filtered results are similar to the ones of the lower performing algorithms 
in [62], while optimal selection provides much better scores, although still not 
as good as the best methods. The large difference between the optimal and the 
statistical significance-based results is quite surprising, especially in the light of 
the fact that statistical significance in itself is able to provide excellent results 
on the LFR benchmark [53] . 

The algorithm was also tested on networks with overlapping communities. In 
this case, clusters having significance score below 0.1 were accepted, similarly 
to [M]. Fig.[5](b) shows that the effect of the imperfect output- filtering is again 
very large, an ideal selection scheme would allow very good results. This is not 
surprising, as other algorithms based on the one of [39] also give excellent results 
on overlapping communities [63j . 

Finally, the new method was applied to a word association graph built from 
the University of South Florida Free Association Norms [53]. Here, nodes are 
words and edges show that some people associated the corresponding two words. 

^it docs not mean that a single 5000-sized graph is too large, however, a few hundred of 
them are 

^''Both the LFR benchmark and the generalized normalized mu- 
tual information are freely available from the authors' web- 
sites, http : / /sites . google . com/site/santof ortunato/inthepress2 and 
http : //sites . google . com/site/andrealancichinetti/sof tware 
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Figure 5: (Color online) Results on the LFR benchmark. Panel (a) corresponds 
to unweighted, undirected and non-overlapping tests, while panel (b) corre- 
sponds to overlapping tests. Overlapping tests were done at two different values 
of the mixing parameter, at /i = 0.1, 0.3. For both panels: full symbols and lines 
correspond to the applied filtering, empty symbols with dotted lines correspond 
to perfect output filtering. 



The network has 5018 nodes with mean degree (k) = 22.0. It is a frequently used 
example of overlapping community structure |34j . |21j . Although edge weights 
are accessible, the algorithm was applied to the unweighted version of the net- 
work. As an illustration, low-level communities around the word bright are plot- 
ted on Fig. [6] An interesting effect is the appearance of overlapping edges, due 
to the heavy overlap in the network. 

In conclusion, although selecting the relevant communities from the output 
is not an already solved task, the algorithm gives good results on the Zachary 
karate club, and performs reasonably on the LFR benchmarks. It should be 
noted however, that due to the internal cohesion criterion, this algorithm's out- 
put is not intended to perfectly match benchmarks like GN and LFR, which 
define communities solely on the basis of external separation. An additional 
observation is reported here: on GN benchmark graphj^ with nodes having 
exactly the prescribed in- and out-degrees, at large mixing ratios communities 
deviating from the built-in ones but having better-than-designed mixing ratios 
were found. Note that the new method does not optimize just for external sepa- 
ration, so even better "spontaneous" communities may exist. This phenomenon, 
although not being a huge surprise, raises the question how to judge precisely 
a community detection method's output at large mixing ratios, as the known 
community structure may not be trusted to 100%. 



6 Discussion Sz, Conclusions 

An important aspect of all community detection methods is the running time. In 
the case of the new method described above, the time requirement is as follows. 
Starting a new community from each node contributes a factor of N to the CPU 

^^results are omitted, as the presented LFR benchmark is a generalization of the GN. 
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Figure 6: (Color online) Communities around bright, on the first hierarchical 
level. Color denotes communities. Gray shows overlapping nodes and edges. 
Black edges are between different communities. 

time. Evaluating the eigenvalues of a community C plus one extra node takes 
2/3 (|C| + 1)'^. Assuming that C has const • (k) ■ \C\ neighboring nodes (i.e., on 
average, each node has a constant fraction of its neighbors outside C), running 
time can be estimated as 

T^N- J2 const . (fc) . \C\ ■ i\C\ + 1)3 « iV . const' • (fc) • |C|Lx (5) 

\C\=1 

A naive estimate for |C|max would be N. However, as more and more commu- 
nity growing processes finish, the newly started communities are expected to 
terminate in a previously discovered community earlier and earlier, on average. 
Of course, some communities will reach |C| = N. Therefore, 

T(xN^+\ 6 £[0,1] (6) 

which is huge and clearly denies the analysis of even medium-sized graphs 
(0(10^) nodes) without further improvements. Note that graphs of thousands of 
nodes may be manageable, like the word association graph shown above, which 
took 56 hours on a single CPU. One possibility is to choose the initial seed more 
intelligently, starting communities from promising seeds. |63j achieved good re- 
sults in this aspect. An intelligent seed selection is also important if the number 
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of communities in a cover is larger than N, or if some communities have only 
overlapping nodes - in this case, it may happen that all growth processes miss 
a certain community. 

Other important question is the applicability of an advanced eigenvalue solver. 
Arpack++ [SS] and SLEPc [5^ were tried. The experience was that - despite 
their good asymptotic performance in the large matrix limit - for the occurring 
several small subgraphs the overhead of these complicated machineries was so 
large that made the final running time much higher than those obtained with 
the QR-decomposition algorithm. 

Doing optimization in a multi-parameter space is a nontrivial task, because dif- 
ferent parameters can lie in different ranges. Therefore, an important direction 
for future research is to investigate the best combination of the parameters in 
the fitness function, based on the evaluation of empirical data. 
Finally, filtering the relevant communities from the found ones is also a challeng- 
ing task. The natural approach is to apply statistical significance, which should 
be applied even if filtering was not needed. However, deciding the threshold 
significance value is not necessarily trivial in all cases. Furthermore, the current 
form of statistical significance accounts only for the separation of the commu- 
nity, not for its internal cohesion. This manifests itself e.g. in the low score of the 
almost-full-clique subgraph in the Zachary karate club (the dark purple group 
on Fig.[3|. As the main advantage of the fitness function of eq.[4]is the inclusion 
of cohesion, it would be important to develop a statistical significance taking it 
into account. 

Conclusions. The community detection problem currently suffers from two 
fundamental deficiencies. First, there is no definition of community which is 
precise enough to allow constructing community finding methods. Second, thor- 
ough testing a proposed algorithm is problematic, not independently from the 
previous difficulty. I attempted to improve both issues. 

In this paper, I proposed a formal list of required properties for locally dense 
subgraphs, taking a step towards an applicable definition of the term "commu- 
nity". Two properties, external separation and internal cohesion ("shape") were 
named. External separation has already been applied by some of the community 
detection methods, and also by benchmarks. Internal cohesion was not consid- 
ered explicitly earlier. No current method was found which satisfactorily applies 
both criterions. I demonstrated on simple examples that both properties are 
necessary; discarding either of them leads to counterintuitive results. Beyond 
allowing to construct new methods, these two criterions can also be used as a 
basis for testing existing ones. They also allow the characterization of a com- 
munity by two independent quantities, instead of a single scalar. 

I proposed a new composite fitness function which takes the two criterions 
into account. For the quantification of the internal cohesion, the second eigen- 
value of the Laplacian matrix is applied, which provides appropriate results 
on characteristic graphs like cliques or chains. I also proposed a heuristic, by 
redesigning the LFK heuristic [35] , which can find overlapping locally dense sub- 
graphs of all scales, producing much less output than multiresolution methods 
but with less restrictions than imposed by assuming a hierarchical structure. 
Runs on the Zachary network and LFR benchmarks showed that the method is 
able to provide the expected results. Overlapping communities can be detected 
especially efficiently, similarly to other LFK-based heuristics "^B]. However, sig- 
nificant improvements are yet to be implemented; e.g. reducing the running 
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time, finding a more effective filtering procedure for the output, or fine-tuning 
the relative weight of the separation and the cohesion terms in the fitness func- 
tion. 
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Appendix 



A Second eigenvalue of two weakly connected 
cliques 

Assume two cliques of n nodes, edge weights are 1. The two cUques are attached 
by a single edge having weight e. Then the eigenvalue equations for the Laplacian 
matrix are 

Xj+Xn — {n—l + X)xi = Vi < n (7) 

j<n 

Xk + Xn+i — (n — 1 + X)xi — \/i > n + 1 (8) 
Xj + e ■ Xn+i — {n — 1 + e + \)xi = i = n (9) 

j<n 

Xk + e ■ Xn — (n — 1 + £ + X)xi — i — n + 1 (10) 

fc>n+l 

Adding the last two equations gives 
^^Xj-Xn-Xn+i + eXn + eXn+i-{n-l + e + X)xn-{n-l + e+X)xn+i = (11) 

3 

The eigenvector corresponding to the first eigenvalue (which is zero) is the con- 
stant vector, therefore for all other eigenvectors the sum of components should 



be zero in order to be orthogonal to the first one. Consequently J2j = 0- 
Applying this and a minimal algebra results 

Xn{X + n)+Xn+iiX + n) =0 (12) 

Xn = -Xn-1 if A 7^ -n (13) 



If A = — n then eqs. [7]|8] reduce to J2j<n — and X)fe>n+i = 0. Now con- 
sider the eigenspace corresponding to A = — n, and look for eigenvectors such 
that Xn = Xn+i = c, Y.jKn ^0 = ""^i Hjyn+1 ^3 = In this eigeuspacc 
the number of free parameters are 1 -I- 2 • (n — 2), corresponding to c and 
xi . . . Xn-i, Xn+2 ■ ■ ■ X2n with two Constraints. Altogether the dimension of the 
eigenspace (the multiplicity of A = —n) is 2n — 3. Adding the A = case, we 
are left with at most two unknown eigenvalues. 

For A ^ —n, we look for the solutions in the form (a, . . . , a, b, —5, — a,N. . . , —a)'^. 
Then the eigenvalue equations are 

{n-2)a + b- {n-l + X)a = (14) 

{n-l)a-e-b-{n-l + e + X)b = (15) 

After simplifications, 

-(l + A)a + fe = (16) 

{n-l)a-{Ti-l + 2e + X)b^0 (17) 
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Expressing A from these equations reads 

a 

A = -(n-l + 2e) + (n-l)^ 

Writing A = A results 

X a b 

-n + l-2e+(n-l)- = 1 

a 

, , (hX^ b 
(-n+l-2e)- + (n-l)= - -- 
a \a J a 

Introducing x — b/a gives 

-x^ + {-n + 2 - 2e)x + (n - 1) = 

n - 2 + 2e ± y(n - 2 + 2e)2 + 4(n - 1) 
^1.2 = 

The term under the radical symbol can be approximated using the 
terms of the Taylor series \/l — x w 1 ~ x/2 



:= J(n + 2e)2(l- 



[n + 2e) 1 



(n + 2e)2 
4e 



(n + 2e)2 



which gives 



2e 

Xi ~ 1 



n + 2e 
X2^l-{n + 2e) + 



n + 2e 

which, using eq. [18} leads to 



n + 2e 

2e 

A2 ~ -(n + 2e) + 



n + 2e 

meaning that the last two eigenvalues of the Laplacian are found. 



19 



B Review of current methods 



Here, a one-by-one review of methods follows, from the point of view of the 
separation & cohesion criterions. 

Separation-targeted methods 

Method of Lancichinetti et al. (LFK) [39] - although being a multiresolu- 
tion method, it is informative to take a look at it with the resolution parameter 
(see eq. [T]) fixed at a = 1. Then the fitness function of a community, which is 
to be optimized, is simply the sum of in-degrees divided by the sum of degrees 
of the community members. Thus, this method is a clear implementation of the 
separation criterion. Consequently, it is not sensitive to the internal distribution 



of edges (Fig. la and |lb| get the same fitness). The cohesion criterion is absent, 
so one clique on Fig. |la| has lower fitness than the union of the two cliques. 
Labelpropagation [40j ~ the communities are defined as sets of nodes such 
that every node should belong to the community to which the majority of their 
neighbors do. Labelpropagation does not qualify the communities, just finds 
partitions obeying the majority rule. Consequently Fig. [Ta| can be judged as 
a proper single community, and Fig. |lb| can be split by collecting each second 
node to the same cluster. 

Infomap [18j - Infomap aims to minimize the length of the description of a ran- 
dom walk, using clusters. The best description length corresponds to the best 
trade-off between small cluster sizes (understood in in-degrees) and few links 
between clusters. It is straightforward to calculate that for the configuration 
on Fig. |la[ Infomap will properly separate the two cliques unless the number of 
inter-community links is larger than 6.9-10^. Although this resolution limit looks 
practically unimportant, shows that Infomap has some conceptual problems. If 



3 edges are placed instead of 1 between the 2 cliques on Fig. la Infomap will 
merge the two cliques if the number of inter-community edges in the rest of the 
network is larger than 149, which is more than 5 orders of magnitude smaller 
than the previous threshold. Two consequences should be drawn: Infomap is 
quite sensitive to the number of inter-community edges, and, as a consequence, 
it can produce counterintuitive communities in realistic graphs. 
Clique Percolation Method (CPM) [21j - communities are defined as max- 
imal sets of adjacent A;-cliques, k being a parameter. Adjacency holds if /c — 1 
nodes are shared by two cliques. Although CPM enforces a very strong cohesion 
locally, it applies only to 0(l)-sized subgraphs of communities. Consequently, 
there are no cohesion requirements on the scale of the whole community. E.g., 
the cliques of a cluster might form a chain and the method gives no information 
about the shape of the cluster. Considering Fig.[Ta| it is trivial to modify it such 
that CPM merges the two large cliques into a single cluster, e.g. using 3-cliques. 
Furthermore, the absence of a single percolating series of neighboring cliques 
means that a subgraph will not appear as a single community, regardless of its 



other parameters (see e.g. Fig. lb applying 4-cliques). Finally, CPM uses the 
same clique size for the whole network, regardless of local variations in edge 
density. 

Method of Radicchi et al. |20| - there are two possible criterions for commu- 
nities to choose from: either all community members or only the whole commu- 
nity should have more links inside than outside. Proper communities are found 
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by iteratively bisecting the network, until no bisection can be carried out with- 
out violating the criterion used. So, the effective definition is that a community 
is a subgraph obeying one of the criterions mentioned above such that no bisec- 
tion of it can result proper communities. Fig. |lb| with a minor tweak would be 
split even using the strong definition, assigning every second node to the same 
community. The tweak is to place the 2 outside links on the kin = 6 nodes. 
Method of Estrada and Hatano [41] - as it relies on the eigenvalues and 
eigenvectors of the whole graph, it is a global method. Therefore whether a set 
of nodes is judged to be a cluster or not depends also on the rest of the graph. 
Unfortunately, the behavior of the eigenvalues and eigenvectors of the adjacency 
matrix of a graph are not well understood. Consequently, empirical tests were 
conducted. If the method is run on only the 12 nodes of Fig. [T] configuration 
a) is cut into the two proper sets, but configuration b) is cut into several small 
(overlapping) clusters, such that all triangles form one. When the 12 nodes are 
attached to a 100-node ring, in which first and second neighbors on both sides 
of a node are attached to the node (degrees are 4) , then for configuration a) the 
two clusters expand to the first neighboring nodes in the ring, and configura- 
tion b) has the same clusters as in the fully separated case. So, if the rest of 
the graph is not denser than the set of nodes under investigation, it seems that 
internal cohesion does matter, however external separation not. If the 100-node 
ring is two degrees denser (first 3 neighbors are attached, degrees are 6), the 12 
nodes coalesce into 1 cluster both for configurations a) and b), incorporating a 
few nearby nodes from the large ring (8 for a) and 6 for b)). For even denser 
100-node rings, the 12 nodes become part of a large cluster containing many 
nodes from the large ring. So, in conclusion, the global character of the method 
makes it indefinite concerning its behavior to the configurations on Fig. [la| and 

m 

Stochastic blockmodels and spin-based methods 

Modularity optimization [17J ~ for each community, modularity counts the 
inside links and their expected values, based on the degrees of the nodes. Due to 
the well-known resolution limit problem |35l I36j , the optimal modularity merge 
the two cliques on Fig.[Ta|for sufficiently large graphs. 

Laplacian spectral algorithm by Donetti and Munoz }42j - although the 
method produces candidate partitions using the spectrum of the Laplacian ma- 
trix, the partitions are evaluated using modularity. Consequently, it is equivalent 
to modularity optimization using a special heuristic, implying all the drawbacks 
of modularity. 

Link partitioning method of Evans and Lambiotte [53] - partitioning is 
done on the so-called line graph, which nodes correspond to the edges of the 
original graph, and links are drawn between edges sharing a node in the orig- 
inal graph. Variants of the modularity function are proposed as goal function 
for the partition. Different variants use different weighting schemes of the edges 
including the addition of self-loops. As these goal functions are still based on 
counting intra-community edges and subtracting some expected value, the res- 
olution limit problem should appear for large enough graphs. 
Method of Ronhovde and Nussinov (RN) [43j - it proposes a Hamiltonian 
T~l-{W}) = -l/2Z]j^i(«»j^u - - Ay))(5(a-i,a-j), A being the adjacency 

matrix, Oij and bij being edge weights. The configuration corresponding to the 
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minimal Hamiltonian is used as the solution. 

The Hamiltonian optimizes simply for the edge densities inside clusters (dis- 
torted by the 7 resolution parameter), which tends to be the largest for cliques. 



Consequently Fig. lb worth to be split into 4 if 7 > 19/35. Similarly, for 



7 < 1/35, the two cliques of la are merged. The fact that the proper value 
of 7 may vary from cluster to cluster can render the global optimization process 
locally unsuccessful. 

Method of Nepusz et al. [44] - the main goal of the work is to provide 
community detection framework using fuzzy (soft) memberships, in order to 
handle overlaps. The proposed realization of the framework, when restricted to 
conventional hard memberships (and unweighted networks), is equivalent to the 
previous method with 7=1, and with a different heuristic. 
Stochastic blockmodel of Hofman and Wiggins [45] - based on the as- 
sumption that the community structure can be fitted by a blockmodel in which 
intra- and inter-cluster nodes are connected with probabilities i9c and i^d respec- 
tively, [45) aims to minimize the Hamiltonian 

where ai is the cluster of node i, is the size of cluster ^, Jq = ln(l — -dd) /(I — 
'&c)^ Jl = Im^c/i^d + "/g- The number and sizes of clusters, the cluster members, 
and the probabilities 'dc and are determined by minimization. In other words, 
a community structure should be found which maximizes the edge densities in- 
side the communities (the Jl^ij — Jg term), with the restrictions that 1) each 
node belongs to exactly one community; 2) the expected intra- and inter-cluster 
edge densities are both constants. The method of Hastings [15] is a special case of 
this method, needing Jq and Jl as input, and discarding the last term in equa- 
tion 30 The method of Ronhovde and Nussinov |43j is also a special case with 
the same restrictions, i.e. being equivalent to the Hastings method. One can see 
immediately that equation [30] defines a global method, which is realized by the 
global Jg and Jl coupling constants. Since the discovery of the resolution limit 
of modularity it is known that globality leads to counterintuitive local trade-offs. 
The situation is not different here, a simple calculation for Fig. [Ta[ shows that 
the two cliques wih be merged if (1 - i?c)/(l - i?d) > 2-i2/35(^^/i?Ji/35^ ^^^[^1^ 
can be approximated by 1 — ??c > 0.79(1 — -dd), assuming that (i^d/t^c)^^''^^^ ~ 1- 
As corresponds to the intra-cluster edge probability, it is a reasonable crite- 



rion. Furthermore, it is similarly simple to show that for Fig. lb splitting into 
four is profitable if (1 - i?d)/(l - -de) > 4i2/35(^^/^^)i9/35^ 
Mixture model of Newman and Leicht [ 47\ - based on some probabilistic 
modeling, [47j proposed the following log-likelihood to be maximized: 

£ = ^ gir In -I- ^ Aij In 6,.^ (31) 



where Qir is the probability that node i belongs to cluster r, tt^ is the frac- 
tion of nodes in cluster r, and Qrj is the probability that a randomly chosen 
link originating in cluster r points to node j. Equation [31] is reminiscent of 
the Hamiltonian of Hofman and Wiggins, although there are important differ- 
ences. Nodes can have memberships in many clusters simultaneously (with the 



22 



constraint that the sum of memberships is 1 for any node) . Inter-cluster edges 
are counted for, while missing edges are never. The coupling strength between 



neighboring nodes is fine-tuned for each node-cluster pair. Considering Fig. lb 
and assuming hard node memberships (i.e. qir is or 1 for all nodes), it is easy 
to show that splitting into 4 is favored over putting all nodes into one cluster. 
Mixture model of Wang and Lai [48] - Wang and Lai improved the mixture 
model of Newman and Leicht, arriving to the log-probability 

£ = ^ gi,r InTTr + ^ A,j \tl + ^(1 - Ay) ln(l - Prj) (32) 

hr \ j j J 

where prj is the probability that a node in cluster r has a link to node j. Now 
£ counts also the missing edges. For a hard clustering [qi^. = or 1) it is easy 
to calculate that Fig. [lb] is preferred in 4 pieces over 1. 

Likelihood modularity of Bickel and Chen [49J - the proposition is to 
maximize 

Qlm = JE"- f— + fl - — ) fl - — )) (33) 

Td "cd V ricd J \ Ucd J J 

where ricd — ncnd ii c d, rice — nc^nc — l), ric is the size of cluster c, and Oca ~ 
J2i£c jed^ij- ^^'^ expression is maximal if the clusters are cliques {Occ/n-cc = 1) 
which are totally separated {Ocd/ncd = 0). As Qlm is symmetric with respect to 
O edited and 1 — Ocd/ncd, bipartite structures can also get high scores, but here 
the analysis is restricted to the cluster-based optimum. First, it should be noted 
that Qlm penalizes clusters in which the edge density deviates significantly 
from its maximal value. Then, it is easy to calculate that it worth to cut Fig. 
[lb] into 4 clusters. 

Stochastic blockmodel of Karrer and Newman [50| - it is similar to the 
previous case. The main difference in the function to be maximized (compared 



to equation 33 1 is the absence of the second logarithmic term representing the 



missing links, and the application of sums of degrees instead of cluster sizes. 



Similarly, a simple calculation shows that Fig. lb gets higher score when split 
into four. 



Other single-scale methods 

Infomod |51] - the aim is to compress the description of the graph, while 
retaining as much information as possible. The description length is given by 

L = n\og^m+^- ^log^Z-flog^n ' 11 ;.. (^4) 

where n is the number of nodes, I is the number of edges, m is the number 
of clusters. As can be seen, it is global method, where trade-offs for a global 
improvement may spoil local structures. And indeed, a straightforward calcu- 
lation shows that for all but very small graphs Fig. [la] is preferred as a single 
community (e.g. I > 128 and m>7). 

Method of Chauhan et al. [52] - the idea is interesting, i. e. to maximize 
the sum of logarithms of the largest eigenvalues of the adjacency matrices of the 
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individual communities. However, the behavior of the largest eigenvalue of the 
adjacency matrix is poorly understood. As a counterexample, given a clique of 
size n, its largest eigenvalue is n — 1, while when it is cut into two, the product of 
the first eigenvalues of the two n/2 — 1-sized cHques is (n/2 — 1)^, which is larger 
than n if n > 8 - so it worth to cut a clique into pieces. This is the consequence 
of using a concave function (log) in the summation, so it can be easily fixed. 
However, for Fig. [ibj the largest eigenvalue is 5.2, while the largest eigenvalue 
of a 3-clique is 2. Summing the largest eigenvalues for the two cases (instead of 
summing their logarithms) results in 5.2 < 8, so it worth to split Fig. lb into 4. 
Link partitioning method of Ahn et al. |54| - although the described 
method applies a hierarchical clustering, using an objective function (edge den- 
sity of clusters) results in a single set of communities. The objective function 
averages the densities of all clusters, consequently it is a global quantity; its 
maximum does not guarantee that each cluster is optimal, just the average - 
nothing prevents the over- or underpartitioning of individual clusters at the 
global optimum. 

Community landscape method of Kovacs et al. (ModuLand) [55j - see 

also at the hierarchical methods. The interesting idea is to give a scalar value to 
the edges indicating how strongly an edge belongs to communities, then identi- 
fying the local maxima & their surroundings ("hills") as the communities. The 
scalar value for the edges is obtained as the number of appearances of the edges 
in some auxiliary clusters. From each node (or edge), an auxiliary cluster is 
grown until its fitness value cannot be increased. Fitness is chosen as simply 
the average in-degree of the nodes in the growing cluster. After all auxiliary 
communities are determined this way, each edge is assigned a value equaling the 
number of times it occurred in the found communities. Edges with the locally 
highest score are defined as community cores. Membership values are assigned 
to remaining edges, based on how strongly are they related to the nearby cores. 
The method, actually a framework for several possible methods, depends heav- 
ily on the applied fitness function of the auxiliary clusters. Here the NodeLand 
auxiliary clustering will be investigated. It is quite easy to engineer graphs in 
the spirit of Fig. [T] which are misclustered. E.g instead of Fig. [Ta| took two 
7-cliques, delete 1 link from each, and connect one node with 3 nodes from the 
other clique, as on Fig[7a[ The fitness of one almost-clique is 40/7, is just below 
the contribution of the node in the other clique (6/1), so the 3 links between 
the cliques will be included in the community. To be precise, starting a cluster 
from each node, one almost-clique -I- the connector node will appear as a cluster 
7-1-1/3 times, and the other almost-clique 6-1-2/3 times. Fractions correspond 
to different possibilities when starting from the connector node. In practice, this 
means that with probability 2/3, all links will have uniform scalar values (per- 
fectly flat landscape, i.e. a single hilltop), and with probability 1/3, a step-like 
landscape (still identified as a single cluster by the method). Symmetrization 
to 6-1-1/3-1-2/3 and 6 -I- 2/3 -I- 1/3 is straightforward, by creating a second 
bridge node also with 3 links. Similarly, Fig. [lb] can be substituted by Fig. [7b] 
It consists of two 5-cliques with connections such that each node has 2 links to 
the other clique. ModuLand-NodeLand will tend to separate the two cliques, 
although their union is much more well-separated from the rest of the graph. 
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(a) (b) 

Figure 7: Subgraphs on which ModuLand-NodeLand gives counterintuitive re- 
sults. Dashed lines show the desired communities. 

Hierarchical methods 

Here, some hierarchical methods will be investigated. The question is whether 
the lowest level can be reliably used as an optimal partition (or cover). As a 
benchmark graph, Fig. [8] will be utilized. The desired output is a single commu- 
nity of 12 nodes, due to their extreme separation from the rest of the graph. 
Method of Ruan and Zhang [68J - the proposition is to iteratively run 




Figure 8: Test case for the lowest level of the hierarchical methods. 

modularity optimization in the found clusters, until the best modularity in- 
side a cluster is not larger significantly than those of a corresponding random 
graph. Numerical calculations show that at the lowest level. Fig. |8] is divided 
into partj^ 

Method of Sales-Pardo et al. [69] - it uses the co-occurrence of nodes in 
different local optima of modularity to construct a new similarity matrix, which 
is fitted by a block diagonal form. Communities are defined by the blocks. The 
method is iteratively re-applied to each community until structure deviating 
from a corresponding random graph is found. Again, running the method on 
Fig. |8] results in overpartitioning (z-score of the split Fig.[8]is 3.9, the threshold 

^^z-score is 5.3, Qmax = 0.36. Z-score is defined as the difference of the modularity of the 
actual graph and the modularity of a 0-model graph, divided by the variance of the modularity 
of the 0-model graph, z-score = {Q — Qo-modeO/fO-model. Criterion of |68j is z-score > 2, 
Qmax > 0.3. Modularities were optimized using the Radatools software 1671 . 
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used by [55] is 2.3). 

Hierarchical Infomap [70] - this is an extension of the Infomap method [H] . 
It is easy to calculate that, similarly to the previous cases, splitting Fig. |8]on a 
lower hierarchical level improves the partition. 

ModuLand [55] - ModuLand can also produce hierarchical structures, by it- 
eratively re-running the clustering procedure on the network of clusters (links 
between clusters are defined by node overlaps). Accordingly, the lowest level 
clusters are the ones obtained by a simple ModuLand run, which is susceptible 
to mispartitioning, as described some paragraphs above. 

OSLOM [34| - the method applies statistical significance as fitness. Although 
its output depends to a certain degree on the whole graph, running it on Fig. 
[s] (as the whole graph) results in a bisection. As the method tries to find the 
so-called minimal significant clusters, by trying to split already found significant 
subgraphs while the rest of the graph is neglected, it will divide Fig. |8] indepen- 
dently of the rest of the graph. 
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